1 Lead

For the following questions, I am going to bring a number of Stock and ETF files into one single data frame, and inspect for further analyses in the future.
Especially, for each Stock and ETF data, I will examine whether there are anomalous values and, if so, why those anomalies happened in the data.
In addition, I will conduct feature engineering for later analyses.

2 Analysis and R code

2.1 Required Libraries

library(tidyverse)
library(RcppRoll)
library(plotly)

2.2 Required Files

stock_list <-list.files("/Users/Jung-yerin/Desktop/SUMMER 2018/AD/Data/Stocks")
etf_list <-list.files("/Users/Jung-yerin/Desktop/SUMMER 2018/AD/Data/ETFs")
length(stock_list);length(etf_list)
## [1] 7196
## [1] 1345

There are total \(7195\) files in Stock file and \(1345\) files in ETFs file.
However, I was able to find some blank files (with no data at all) in Stock ones.
As such, I applied additional file.info(file)$size > 0 when making a code to bring Stock files.

# Make one common Stock File
setwd("/Users/Jung-yerin/Desktop/SUMMER 2018/AD/Data/Stocks") 

for (file in stock_list){
  
    if (!exists("stock_data")){
      stock_data <- read_csv(file)
      stock_data$Symbol <-gsub('(.*).us.txt','\\1',file)
      write_csv(stock_data,"stock_data.csv", col_names = TRUE)
    }
 
    if (exists("stock_data") & file.info(file)$size > 0){
      temp_data <-read_csv(file)
      temp_data$Symbol <- gsub('(.*).us.txt','\\1',file)
      write_csv(temp_data, "stock_data.csv", append = TRUE)
  }
}

stock_DF<-read_csv("/Users/Jung-yerin/Desktop/SUMMER 2018/AD/Data/Stocks/stock_data.csv")
dim(stock_DF)
## [1] 27640460        8
# Make one common ETF File
setwd("/Users/Jung-yerin/Desktop/SUMMER 2018/AD/Data/ETFs")

for (file in etf_list){
       
  if (!exists("etf_data")){
    etf_data <- read_csv(file)
    etf_data$Symbol <-gsub('(.*).us.txt','\\1',file)
    write_csv(etf_data,"etf_data.csv", col_names = TRUE)
  }
   
  if (exists("etf_data")){
    temp_data <-read_csv(file)
    temp_data$Symbol <- gsub('(.*).us.txt','\\1',file)
    write_csv(temp_data, "etf_data.csv", append = TRUE)
  }
}

etf_DF<-read_csv("/Users/Jung-yerin/Desktop/SUMMER 2018/AD/Data/ETFs/etf_data.csv")
dim(etf_DF)
## [1] 3112903       8

2.3 Stock Data

2.3.1 Exploratory

summary(stock_DF)
##       Date                 Open                High          
##  Min.   :1962-01-02   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:2007-12-03   1st Qu.:8.000e+00   1st Qu.:8.000e+00  
##  Median :2012-02-17   Median :1.600e+01   Median :1.600e+01  
##  Mean   :2010-06-20   Mean   :3.268e+04   Mean   :3.356e+04  
##  3rd Qu.:2015-05-20   3rd Qu.:2.900e+01   3rd Qu.:2.900e+01  
##  Max.   :2017-11-10   Max.   :1.424e+09   Max.   :1.442e+09  
##                       NA's   :4246        NA's   :4854       
##       Low                 Close               Volume             OpenInt 
##  Min.   :        -1   Min.   :0.000e+00   Min.   :0.000e+00   Min.   :0  
##  1st Qu.:         8   1st Qu.:8.000e+00   1st Qu.:3.239e+04   1st Qu.:0  
##  Median :        15   Median :1.600e+01   Median :1.900e+05   Median :0  
##  Mean   :     31576   Mean   :3.253e+04   Mean   :1.587e+06   Mean   :0  
##  3rd Qu.:        28   3rd Qu.:2.900e+01   3rd Qu.:8.788e+05   3rd Qu.:0  
##  Max.   :1362117844   Max.   :1.438e+09   Max.   :2.070e+09   Max.   :0  
##  NA's   :4694         NA's   :8114        NA's   :4                      
##     Symbol         
##  Length:27640460   
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

I found several NAs over four columns. As such, I decided to examine and remove them, unless they are critical.

# Five examples with NA in the `High` column 
stock_DF %>% filter(is.na(High)) %>% head()
## # A tibble: 6 x 8
##   Date        Open  High   Low Close Volume OpenInt Symbol
##   <date>     <dbl> <dbl> <dbl> <dbl>  <int>   <int> <chr> 
## 1 2012-04-05   453    NA   441    NA   8391       0 aezs  
## 2 2012-04-09   498    NA   462   480   7694       0 aezs  
## 3 2012-04-13   414    NA   396   402   2848       0 aezs  
## 4 2012-04-18   378    NA    NA    NA   2446       0 aezs  
## 5 2012-04-19    NA    NA    NA   375   1803       0 aezs  
## 6 2012-04-24    NA    NA    NA    NA   1670       0 aezs

So, I will remove those NAs and Low==-1 that we have already detected during the first class.

stock <- stock_DF[complete.cases(stock_DF),]
stock <-stock[stock$Low!=-1,]
stock$Symbol <- toupper(stock$Symbol) #Just because UPPER Symbol is more familiar
summary(stock)
##       Date                 Open                High          
##  Min.   :1962-01-02   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:2007-11-30   1st Qu.:8.000e+00   1st Qu.:8.000e+00  
##  Median :2012-02-16   Median :1.600e+01   Median :1.600e+01  
##  Mean   :2010-06-20   Mean   :3.268e+04   Mean   :3.356e+04  
##  3rd Qu.:2015-05-20   3rd Qu.:2.900e+01   3rd Qu.:2.900e+01  
##  Max.   :2017-11-10   Max.   :1.424e+09   Max.   :1.442e+09  
##       Low                Close               Volume             OpenInt 
##  Min.   :0.000e+00   Min.   :0.000e+00   Min.   :0.000e+00   Min.   :0  
##  1st Qu.:8.000e+00   1st Qu.:8.000e+00   1st Qu.:3.240e+04   1st Qu.:0  
##  Median :1.500e+01   Median :1.600e+01   Median :1.901e+05   Median :0  
##  Mean   :3.158e+04   Mean   :3.252e+04   Mean   :1.587e+06   Mean   :0  
##  3rd Qu.:2.800e+01   3rd Qu.:2.900e+01   3rd Qu.:8.792e+05   3rd Qu.:0  
##  Max.   :1.362e+09   Max.   :1.438e+09   Max.   :2.070e+09   Max.   :0  
##     Symbol         
##  Length:27630232   
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Now, I am going to explore the data graphically.

stock %>% 
  filter(Symbol=='AAPL'|Symbol=="IBM"|Symbol=="GOOGL")%>%
  filter(Date>"2010-01-01")%>%
  ggplot(aes(x=Date))+
  geom_line(aes(y=High), colour='blue')+
  geom_line(aes(y=Low), colour='grey')+
  ylab(label="High and Low")+
  facet_wrap(~Symbol, scales="free_y")+
  ggtitle("Three Companies' Stocks Trends over time (High&Low)")

Firstly, I looked at the three major tech companies’ stock prices (Daily High & Low) with adjusted scales.
I was able to find that when one’s stock price shows a rising tendency, it tends to have smaller gap between High and Low prices.

stock%>%
  ggplot(aes(x=Volume))+geom_histogram()+
  ggtitle("Stocks Volume Histogram")

stock %>%
  ggplot(aes(x=log(Volume)))+geom_histogram()+
  ggtitle("Stocks log(Volume) Histogram")

As we examined during the class, since the Volumne variable is highly right skewed, we would better to use log(Volume).

2.3.2 Feature Engineering

I decided to do feature engineering to see more general trends from stock data.

stock <- stock %>%
  group_by(Symbol) %>% 
  mutate(Open_Change=Open-lag(Open),
         High_Change=High-lag(High),
         Low_Change=Low-lag(Low),
         Close_Change=Close-lag(Close),
         Volume_Change=Volume-lag(Volume)) %>%
  mutate(Open_PctChange = (Open/lag(Open)-1) * 100, 
         High_PctChange= (High/lag(High)-1) * 100,
         Low_PctChange= (Low/lag(Low)-1) * 100,
         Close_PctChange= (Close/lag(Close)-1) * 100,
         Volume_PctChange= (Volume/lag(Volume)-1) * 100) %>% 
  mutate(Open_Mean30=roll_mean(Open, 30, na.rm=TRUE, align="right", fill = NA),
         High_Mean30=roll_mean(High, 30, na.rm=TRUE, align="right", fill = NA),
         Low_Mean30=roll_mean(Low, 30, na.rm=TRUE, align="right", fill = NA),
         Close_Mean30=roll_mean(Close, 30, na.rm=TRUE, align="right", fill = NA),
         Volume_Mean30=roll_mean(Volume, 30, na.rm=TRUE, align="right", fill = NA)) %>% 
  mutate(Open_Ch_Mean30=roll_mean(Open_Change, 30, na.rm=TRUE, align="right", fill = NA),
         High_Ch_Mean30=roll_mean(High_Change, 30, na.rm=TRUE, align="right", fill = NA),
         Low_Ch_Mean30=roll_mean(Low_Change, 30, na.rm=TRUE, align="right", fill = NA),
         Close_Ch_Mean30=roll_mean(Close_Change, 30, na.rm=TRUE, align="right", fill = NA),
         Volumne_Ch_Mean30=roll_mean(Volume_Change, 30, na.rm=TRUE, align="right", fill = NA)) %>%
  mutate(Open_PctCh_Mean30=roll_mean(Open_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
         High_PctCh_Mean30=roll_mean(High_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
         Low_PctCh_Mean30=roll_mean(Low_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
         Close_PctCh_Mean30=roll_mean(Close_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
         Volumne_PctCh_Mean30=roll_mean(Volume_PctChange, 30, na.rm=TRUE, align="right", fill = NA)) %>%
  ungroup()

tail(stock)
## # A tibble: 6 x 33
##   Date        Open  High   Low Close Volume OpenInt Symbol Open_Change
##   <date>     <dbl> <dbl> <dbl> <dbl>  <int>   <int> <chr>        <dbl>
## 1 2017-11-03  9.83  10.5  9.83  10.3 531495       0 ZYNE         0.240
## 2 2017-11-06 10.4   11.5 10.4   11.2 977948       0 ZYNE         0.590
## 3 2017-11-07 11.3   11.4 10.7   10.8 451210       0 ZYNE         0.880
## 4 2017-11-08 10.7   11.1 10.4   10.9 336449       0 ZYNE        -0.600
## 5 2017-11-09 11.0   11.9 11.0   11.6 463067       0 ZYNE         0.300
## 6 2017-11-10 11.7   13.2 11.3   12.5 885587       0 ZYNE         0.680
## # ... with 24 more variables: High_Change <dbl>, Low_Change <dbl>,
## #   Close_Change <dbl>, Volume_Change <int>, Open_PctChange <dbl>,
## #   High_PctChange <dbl>, Low_PctChange <dbl>, Close_PctChange <dbl>,
## #   Volume_PctChange <dbl>, Open_Mean30 <dbl>, High_Mean30 <dbl>,
## #   Low_Mean30 <dbl>, Close_Mean30 <dbl>, Volume_Mean30 <dbl>,
## #   Open_Ch_Mean30 <dbl>, High_Ch_Mean30 <dbl>, Low_Ch_Mean30 <dbl>,
## #   Close_Ch_Mean30 <dbl>, Volumne_Ch_Mean30 <dbl>,
## #   Open_PctCh_Mean30 <dbl>, High_PctCh_Mean30 <dbl>,
## #   Low_PctCh_Mean30 <dbl>, Close_PctCh_Mean30 <dbl>,
## #   Volumne_PctCh_Mean30 <dbl>
stock %>% 
  filter(Symbol=='AAPL'|Symbol=='GOOGL'|Symbol=="IBM")%>%
  filter(Date>"2010-01-01")%>%
  ggplot(aes(x=Date))+
  geom_line(aes(y=High_Mean30), color="blue")+
  geom_line(aes(y=Low_Mean30), color="grey")+
  ylab(label="High_Mean30 and Low_Mean30")+
  facet_wrap(~Symbol, scales="free_y")+
  ggtitle("Three Companies' Mean High & Low Prices every 30 Days")

Compare to the previous Open price plot(“Three Companies’ Stocks Trends over time (High&Low)),
now the plot has way less granularity with maintaining the general trends.
As such, I am going to keep using Open_Mean30 instead of Open to find interesting points from the data.

stock %>% 
  filter(Symbol=='AAPL'|Symbol=='GOOGL'|Symbol=="IBM")%>%
  filter(Date>"2016-01-01")%>%
  ggplot(aes(x=Date, y=Open_Ch_Mean30, color=Symbol))+
  geom_line()+
  ggtitle("Three Companies' Avg. Change in Open Prices every 30 Days")

stock %>% 
  filter(Symbol=='AAPL'|Symbol=='GOOGL'|Symbol=="IBM")%>%
  filter(Date>"2016-01-01")%>%
  ggplot(aes(x=Date, y=Open_PctCh_Mean30, color=Symbol))+
  geom_line()+
  ggtitle("Three Companies' Avg. % Change in Open Prices every 30 Days")

In addition to making XXXX_Mean30, I also added average change every 30 days, such as Open_Ch_Mean30, and average % change every 30 days, such as Open_PctCh_Mean30.
By doing so, I wanted to see the general fluctuation changes of stock prices.

From the first plot, it looks like GOOGL‘s open stock price fluctuates the most. While this is true, though, it is mainly because the company has the biggest stock quantity.
As we can see from the second plot, which is three tech companies’ average % change in stock prices, the companies tend to show similar trends, except that IBM has a very different tendency during the first and second quarter of 2017. We may want to examine what happend during that time.

stock %>%
  filter(Date>="2008-05-01" & Date <="2009-12-31") %>%
  filter(Symbol=='XL'|Symbol=="GNW") %>%
  ggplot(aes(x=Date, y=Open_Mean30, color=Symbol))+
  geom_line()+
  ggtitle("Stocks that Crashed in 2008")+
  geom_vline(xintercept = as.Date(c("2008-09-01","2008-10-31")))

stock %>%
  filter(Date>="2008-05-01" & Date <="2009-12-31") %>%
  filter(Symbol=='AMGN'|Symbol=="WMT") %>%
  ggplot(aes(x=Date, y=Open_Mean30, color=Symbol))+
  geom_line()+
  ggtitle("Stocks that Thrived in 2008")+
  geom_vline(xintercept = as.Date(c("2008-09-01","2008-09-30")))

The first plot depicts two stocks that performed the worst during the recession, and the second one describes two stocks that relatively thrived during the recession in 2008.
From further analyses, we might be able to analyze the reasons why some stocks especially crashed, while the others survived well.

2.4 ETF Data

2.4.1 Exploratory

summary(etf_DF)
##       Date                 Open               High         
##  Min.   :1999-03-10   Min.   :       0   Min.   :       0  
##  1st Qu.:2010-12-17   1st Qu.:      23   1st Qu.:      23  
##  Median :2013-09-24   Median :      35   Median :      35  
##  Mean   :2013-03-25   Mean   :    1860   Mean   :    1954  
##  3rd Qu.:2015-11-10   3rd Qu.:      55   3rd Qu.:      55  
##  Max.   :2017-11-10   Max.   :34116000   Max.   :37152000  
##                       NA's   :745        NA's   :752       
##       Low               Close              Volume             OpenInt 
##  Min.   :       0   Min.   :       0   Min.   :0.000e+00   Min.   :0  
##  1st Qu.:      22   1st Qu.:      23   1st Qu.:9.002e+03   1st Qu.:0  
##  Median :      34   Median :      35   Median :4.489e+04   Median :0  
##  Mean   :    1782   Mean   :    1854   Mean   :1.427e+06   Mean   :0  
##  3rd Qu.:      54   3rd Qu.:      55   3rd Qu.:2.372e+05   3rd Qu.:0  
##  Max.   :33696000   Max.   :35304000   Max.   :1.519e+09   Max.   :0  
##  NA's   :734        NA's   :445                                       
##     Symbol         
##  Length:3112903    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
etfs <- etf_DF[complete.cases(etf_DF),]
etfs$Symbol <- toupper(etfs$Symbol)
summary(etfs) #After remove NAs and adjust Symbols
##       Date                 Open               High         
##  Min.   :1999-03-10   Min.   :       0   Min.   :       0  
##  1st Qu.:2010-12-17   1st Qu.:      23   1st Qu.:      23  
##  Median :2013-09-24   Median :      35   Median :      35  
##  Mean   :2013-03-25   Mean   :    1860   Mean   :    1954  
##  3rd Qu.:2015-11-10   3rd Qu.:      55   3rd Qu.:      55  
##  Max.   :2017-11-10   Max.   :34116000   Max.   :37152000  
##       Low               Close              Volume             OpenInt 
##  Min.   :       0   Min.   :       0   Min.   :0.000e+00   Min.   :0  
##  1st Qu.:      22   1st Qu.:      23   1st Qu.:9.000e+03   1st Qu.:0  
##  Median :      34   Median :      35   Median :4.485e+04   Median :0  
##  Mean   :    1782   Mean   :    1854   Mean   :1.426e+06   Mean   :0  
##  3rd Qu.:      54   3rd Qu.:      55   3rd Qu.:2.369e+05   3rd Qu.:0  
##  Max.   :33696000   Max.   :35304000   Max.   :1.519e+09   Max.   :0  
##     Symbol         
##  Length:3111965    
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
etfs %>%
  filter(Open>10000000)
## # A tibble: 218 x 8
##    Date           Open     High      Low    Close Volume OpenInt Symbol
##    <date>        <dbl>    <dbl>    <dbl>    <dbl>  <int>   <int> <chr> 
##  1 2010-12-03 22090000 22090000 19530000 19695000      0       0 TVIX  
##  2 2010-12-06 19942500 20000000 18655000 18922500      0       0 TVIX  
##  3 2010-12-07 18750000 18750000 17525000 18532500      0       0 TVIX  
##  4 2010-12-08 18430000 18975000 17500000 17730000      0       0 TVIX  
##  5 2010-12-09 16690000 17370000 16690000 16872500      0       0 TVIX  
##  6 2010-12-10 16655000 16725000 16362500 16587500      0       0 TVIX  
##  7 2010-12-13 16300000 17250000 16300000 17187500      0       0 TVIX  
##  8 2010-12-14 17095000 17600000 16982500 17375000      0       0 TVIX  
##  9 2010-12-15 17300000 18200000 16962500 18200000      0       0 TVIX  
## 10 2010-12-16 18025000 18287500 17447500 17895000      0       0 TVIX  
## # ... with 208 more rows

UVXY and TVIX showed extremely high OHLC prices during the certain period.
UVXY has 8 splits in its history (https://www.splithistory.com/uvxy/), with the first starting on March 08, 2012.
TVIX has 5 splits according to split.history.com(https://www.splithistory.com/?symbol=tvix), with the first on December 21, 2012.

We can also see these two ETFs regarding splits via graphs.

a <- list(text = "First Stock Split",
          x = '2012-03-08',
          y = 1.02,
          xref = 'x',
          yref = 'paper',
          xanchor = 'left',
          showarrow = FALSE
          )

# use shapes to create a line
l <- list(type = line,
          x0 = '2012-03-08',
          x1 = '2012-03-08',
          y0 = 0,
          y1 = 1,
          xref = 'x',
          yref = 'paper',
          line = list(color = 'black',
                      width = 0.5)
          )
p <- etfs %>%
  filter(Symbol=="UVXY")%>%
  plot_ly(x = ~Date, type="ohlc",
          open = ~Open, close = ~Close,
          high = ~High, low = ~Low) %>%
  layout(title = "UVXY With the First Stock Split",
         annotations = a,
         shapes = l, 
         xaxis = list(rangeslider = list(visible = F)))

p
a <- list(text = "First Stock Split",
          x = '2012-12-21',
          y = 1.02,
          xref = 'x',
          yref = 'paper',
          xanchor = 'left',
          showarrow = FALSE
          )

# use shapes to create a line
l <- list(type = line,
          x0 = '2012-12-21',
          x1 = '2012-12-21',
          y0 = 0,
          y1 = 1,
          xref = 'x',
          yref = 'paper',
          line = list(color = 'black',
                      width = 0.5)
          )
p <- etfs %>%
  filter(Symbol=="TVIX")%>%
  plot_ly(x = ~Date, type="ohlc",
          open = ~Open, close = ~Close,
          high = ~High, low = ~Low) %>%
  layout(title = "TVIX With the First Stock Split",
         annotations = a,
         shapes = l, 
         xaxis = list(rangeslider = list(visible = F)))

p
etfs %>%
  filter(Volume>1e+08)
## # A tibble: 6,964 x 8
##    Date        Open  High   Low Close    Volume OpenInt Symbol
##    <date>     <dbl> <dbl> <dbl> <dbl>     <int>   <int> <chr> 
##  1 2008-11-21  12.0  13.1  11.2  13.0 137101117       0 DDM   
##  2 2008-11-21  66.2  69.3  63.9  69.1 106636496       0 DIA   
##  3 2007-02-27  33.3  34.5  30.9  31.8 101592969       0 EEM   
##  4 2007-07-26  40.4  40.4  38.2  39.2 112175189       0 EEM   
##  5 2007-08-01  38.6  39.2  38.1  38.9 120207365       0 EEM   
##  6 2007-08-16  34.8  35.3  32.8  34.9 158324089       0 EEM   
##  7 2007-08-17  36.2  37.4  34.9  36.0 112881648       0 EEM   
##  8 2007-11-28  43.8  45.7  43.8  45.7 114098336       0 EEM   
##  9 2008-01-08  43.4  43.9  42.5  42.6 100347472       0 EEM   
## 10 2008-01-15  43.0  43.2  41.8  41.9 102457098       0 EEM   
## # ... with 6,954 more rows

All data do not seem to be technically abnormal.

Now, I am going to explore the data graphically.

etfs %>% 
  filter(Symbol=='SPY'|Symbol=='EEM'|Symbol=="XLF")%>%
  filter(Date>"2005-01-01")%>%
  ggplot(aes(x=Date))+
  geom_line(aes(y=High), color="blue")+
  geom_line(aes(y=Low), color="grey")+
  facet_wrap(~Symbol, scales="free_y")+
  ylab(label="High and Low")+
  ggtitle("Three ETFs High & Low Prices")

This time, I looked at three famous ETF’s High and Low prices.
Generally, ETFs’ prices had a slightly larger gap between High and Low than Stocks’ ones.

Interestingly, I was able to find a strange looking High price from XLF, and decide to examine it.

etfs %>%
  filter(Symbol=="XLF", High>40)
## # A tibble: 1 x 8
##   Date        Open  High   Low Close    Volume OpenInt Symbol
##   <date>     <dbl> <dbl> <dbl> <dbl>     <int>   <int> <chr> 
## 1 2011-06-16  10.7  52.7  10.6  10.7 186227724       0 XLF

There was only one day during the entire period that XLF’s High price was over 50. I seached the High price online (From Yahoo Finance), and was not able to find any price over 50.
As such, I assume that this was an error, and need to be adjusted.

etfs %>%
  ggplot(aes(x=Volume))+geom_histogram()+
  ggtitle("ETFs Volume Histogram")

etfs %>%
  ggplot(aes(x=log(Volume)))+geom_histogram()+
  ggtitle("ETFs log(Volume) Histogram")

Like Stocks data, since the Volume is highly right skewed, it is better to see log(Volume).

2.4.2 Feature Engineering

I decided to do feature engineering to see more general trends from ETFs data.

etfs <- etfs %>%
  group_by(Symbol) %>% 
  mutate(Open_Change=Open-lag(Open),
         High_Change=High-lag(High),
         Low_Change=Low-lag(Low),
         Close_Change=Close-lag(Close),
         Volume_Change=Volume-lag(Volume)) %>%
  mutate(Open_PctChange = (Open/lag(Open)-1) * 100, 
         High_PctChange= (High/lag(High)-1) * 100,
         Low_PctChange= (Low/lag(Low)-1) * 100,
         Close_PctChange= (Close/lag(Close)-1) * 100,
         Volume_PctChange= (Volume/lag(Volume)-1) * 100) %>% 
  mutate(Open_Mean30=roll_mean(Open, 30, na.rm=TRUE, align="right", fill = NA),
         High_Mean30=roll_mean(High, 30, na.rm=TRUE, align="right", fill = NA),
         Low_Mean30=roll_mean(Low, 30, na.rm=TRUE, align="right", fill = NA),
         Close_Mean30=roll_mean(Close, 30, na.rm=TRUE, align="right", fill = NA),
         Volume_Mean30=roll_mean(Volume, 30, na.rm=TRUE, align="right", fill = NA)) %>% 
  mutate(Open_Ch_Mean30=roll_mean(Open_Change, 30, na.rm=TRUE, align="right", fill = NA),
         High_Ch_Mean30=roll_mean(High_Change, 30, na.rm=TRUE, align="right", fill = NA),
         Low_Ch_Mean30=roll_mean(Low_Change, 30, na.rm=TRUE, align="right", fill = NA),
         Close_Ch_Mean30=roll_mean(Close_Change, 30, na.rm=TRUE, align="right", fill = NA),
         Volumne_Ch_Mean30=roll_mean(Volume_Change, 30, na.rm=TRUE, align="right", fill = NA)) %>%
  mutate(Open_PctCh_Mean30=roll_mean(Open_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
         High_PctCh_Mean30=roll_mean(High_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
         Low_PctCh_Mean30=roll_mean(Low_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
         Close_PctCh_Mean30=roll_mean(Close_PctChange, 30, na.rm=TRUE, align="right", fill = NA),
         Volumne_PctCh_Mean30=roll_mean(Volume_PctChange, 30, na.rm=TRUE, align="right", fill = NA)) %>%
  ungroup()

tail(etfs)
## # A tibble: 6 x 33
##   Date        Open  High   Low Close Volume OpenInt Symbol Open_Change
##   <date>     <dbl> <dbl> <dbl> <dbl>  <int>   <int> <chr>        <dbl>
## 1 2017-11-03  30.7  32.1  30.6  31.8  37680       0 ZSL          0    
## 2 2017-11-06  31.8  31.8  30.4  30.4  64802       0 ZSL          1.07 
## 3 2017-11-07  31.4  31.6  31.4  31.4  24947       0 ZSL         -0.400
## 4 2017-11-08  31.0  31.3  30.4  31.3  48771       0 ZSL         -0.410
## 5 2017-11-09  31.5  31.5  31.2  31.3  34983       0 ZSL          0.520
## 6 2017-11-10  31.2  32.1  31.0  31.7  34123       0 ZSL         -0.320
## # ... with 24 more variables: High_Change <dbl>, Low_Change <dbl>,
## #   Close_Change <dbl>, Volume_Change <int>, Open_PctChange <dbl>,
## #   High_PctChange <dbl>, Low_PctChange <dbl>, Close_PctChange <dbl>,
## #   Volume_PctChange <dbl>, Open_Mean30 <dbl>, High_Mean30 <dbl>,
## #   Low_Mean30 <dbl>, Close_Mean30 <dbl>, Volume_Mean30 <dbl>,
## #   Open_Ch_Mean30 <dbl>, High_Ch_Mean30 <dbl>, Low_Ch_Mean30 <dbl>,
## #   Close_Ch_Mean30 <dbl>, Volumne_Ch_Mean30 <dbl>,
## #   Open_PctCh_Mean30 <dbl>, High_PctCh_Mean30 <dbl>,
## #   Low_PctCh_Mean30 <dbl>, Close_PctCh_Mean30 <dbl>,
## #   Volumne_PctCh_Mean30 <dbl>
etfs %>% 
  filter(Symbol=='SPY'|Symbol=='EEM'|Symbol=="XLF")%>%
  filter(Date>"2005-01-01")%>%
  ggplot(aes(x=Date))+
  geom_line(aes(y=High_Mean30), color="blue")+
  geom_line(aes(y=Low_Mean30), color="grey")+
  facet_wrap(~Symbol, scales="free_y")+
  ylab(label="High_Mean30 and Low_Mean30")+
  ggtitle("Three ETFs' Mean High & Low Prices every 30 Days")

Just like the Stocks case, now the plot has less granularity.

etfs %>% 
  filter(Symbol=='SPY'|Symbol=='EEM'|Symbol=="XLF")%>%
  filter(Date>"2005-01-01")%>%
  ggplot(aes(x=Date, y=Open_Ch_Mean30, color=Symbol))+
  geom_line()+
  ggtitle("Three ETFs' Avg. Change in Open Prices every 30 Days")

etfs %>% 
  filter(Symbol=='SPY'|Symbol=='EEM'|Symbol=="XLF")%>%
  filter(Date>"2005-01-01")%>%
  ggplot(aes(x=Date, y=Open_PctCh_Mean30, color=Symbol))+
  geom_line()+
  ggtitle("Three ETFs' Avg. % Change in Open Prices every 30 Days")

As we can see from the above two plots, those three ETFs’ stocks have a very similar tendency over time.
Although, there is an ETF that fluctuates more than the others(XLF), they mostly move along together.

3 Conclusion

  1. Stock prices might need to be adjusted. Stocks can be split and dividends also affect the nominal value of stock price. As such, for more accurate analysis later, we need to adjust prices accordingly.
  2. There were some stocks that evidently performed the worst, and others that relatively thrived in the recession in 2008. We might want to explore the reasons why.
  3. The features that I have engineered will help me understand more general trends, with ignoring trivial daily errors.
    - XXX_Change(Open_Change, Close_Change, and so on) and XXX_PctChange will help me understand how a stock change daily.
    - XXX_Mean30 will help me see the more general trend of a stock, by examining change in every 30 days.
    - XXX_Ch_Mean30 and XXX_PctCh_Mean30 will help me understand which stocks fluctuate severely than others.
  4. The current engineered features will have synergy effects once we can get to access external data, such as news or companies’ industry and so on.